{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-K1cD9V8SDIG"
   },
   "source": [
    "# Bonito Tutorial with A100\n",
    "This tutorial runs Bonito on A100 GPUs to generate synthetic instruction tuning datasets.\n",
    "To use Bonito with A100 GPUs, you will need to purchase compute units from Google. The price starts from $9.99 for 100 compute units. See [pricing](https://colab.research.google.com/signup) for more details.\n",
    "\n",
    " If you are looking to run Bonito (for free) on the T4 GPUs, check our [quantized Bonito tutorial](https://colab.research.google.com/drive/1tfAqUsFaLWLyzhnd1smLMGcDXSzOwp9r?usp=sharing).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Gyh5HAFxQlaH"
   },
   "source": [
    "## Setup\n",
    "First we clone into the repo and install the dependencies. This will take several minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-lqD8IrM8Vo0"
   },
   "outputs": [],
   "source": [
    "!git clone https://github.com/BatsResearch/bonito.git\n",
    "!pip install -U bonito/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xWYY7FYfQyAD"
   },
   "source": [
    "## Load the Bonito Model\n",
    "Loads the weights from Huggingface into the Bonito class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "s5k0He_jiJeo"
   },
   "outputs": [],
   "source": [
    "from bonito import Bonito\n",
    "\n",
    "bonito = Bonito(\"BatsResearch/bonito-v1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "86OvwN74RcS8"
   },
   "source": [
    "## Synthetic Data Generation\n",
    "Here we first show how to use the Bonito model with an unannotated text and then show how to generate instruction tuning dataset with a small unannotated dataset.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FEAqk24gpoVO"
   },
   "source": [
    "### Single example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "cwlNfTKLCUDp",
    "outputId": "0933640c-35f8-4204-8433-df57abd9827a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('1. “Confidential Information”, whenever used in this Agreement, shall mean '\n",
      " 'any data, document, specification and other information or material, that is '\n",
      " 'delivered or disclosed by UNHCR to the Recipient in any form whatsoever, '\n",
      " 'whether orally, visually in writing or otherwise (including computerized '\n",
      " 'form), and that, at the time of disclosure to the Recipient, is designated '\n",
      " 'as confidential.')\n"
     ]
    }
   ],
   "source": [
    "from pprint import pprint\n",
    "\n",
    "unannotated_paragraph = \"\"\"1. “Confidential Information”, whenever used in this Agreement, shall mean any data, document, specification and other information or material, that is delivered or disclosed by UNHCR to the Recipient in any form whatsoever, whether orally, visually in writing or otherwise (including computerized form), and that, at the time of disclosure to the Recipient, is designated as confidential.\"\"\"\n",
    "pprint(unannotated_paragraph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "u_xYp60oCjVz"
   },
   "source": [
    "Now generate a pair of synthetic instruction for unannotated paragraph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "k4lreUPb0LUX"
   },
   "outputs": [],
   "source": [
    "from datasets import Dataset\n",
    "from vllm import SamplingParams\n",
    "from transformers import set_seed\n",
    "\n",
    "set_seed(2)\n",
    "\n",
    "\n",
    "def convert_to_dataset(text):\n",
    "    dataset = Dataset.from_list([{\"input\": text}])\n",
    "    return dataset\n",
    "\n",
    "\n",
    "sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)\n",
    "synthetic_dataset = bonito.generate_tasks(\n",
    "    convert_to_dataset(unannotated_paragraph),\n",
    "    context_col=\"input\",\n",
    "    task_type=\"nli\",\n",
    "    sampling_params=sampling_params,\n",
    ")\n",
    "pprint(\"----Generated Instructions----\")\n",
    "pprint(f'Input: {synthetic_dataset[0][\"input\"]}')\n",
    "pprint(f'Output: {synthetic_dataset[0][\"output\"]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2IFs82gLJJFk"
   },
   "source": [
    "Now we change the task type from NLI (nli) to multiple choice question answering (mcqa). For more details, see [supported task types](https://github.com/BatsResearch/bonito?tab=readme-ov-file#supported-task-types)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CUtgkf8EJKxF"
   },
   "outputs": [],
   "source": [
    "set_seed(0)\n",
    "sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.7, n=1)\n",
    "synthetic_dataset = bonito.generate_tasks(\n",
    "    convert_to_dataset(unannotated_paragraph),\n",
    "    context_col=\"input\",\n",
    "    task_type=\"mcqa\",  # changed\n",
    "    sampling_params=sampling_params,\n",
    ")\n",
    "pprint(\"----Generated Instructions----\")\n",
    "pprint(f'Input: {synthetic_dataset[0][\"input\"]}')\n",
    "pprint(f'Output: {synthetic_dataset[0][\"output\"]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mEU1lp5TVjGj"
   },
   "source": [
    "### Small dataset\n",
    "We select 10 unannoated samples from the ContractNLI dataset and convert them into NLI instruction tuning dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qMrbj4dbC2Lm"
   },
   "outputs": [],
   "source": [
    "# load dataset with unannotated text\n",
    "from datasets import load_dataset\n",
    "\n",
    "unannotated_dataset = load_dataset(\n",
    "    \"BatsResearch/bonito-experiment\", \"unannotated_contract_nli\"\n",
    ")[\"train\"].select(range(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HKZEbZuiGMuZ"
   },
   "source": [
    "Generate the synthetic NLI dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "52hWL50gDQnH"
   },
   "outputs": [],
   "source": [
    "# Generate synthetic instruction tuning dataset\n",
    "from pprint import pprint\n",
    "from vllm import SamplingParams\n",
    "from transformers import set_seed\n",
    "\n",
    "set_seed(42)\n",
    "\n",
    "sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)\n",
    "synthetic_dataset = bonito.generate_tasks(\n",
    "    unannotated_dataset,\n",
    "    context_col=\"input\",\n",
    "    task_type=\"nli\",\n",
    "    sampling_params=sampling_params,\n",
    ")\n",
    "pprint(\"----Generated Instructions----\")\n",
    "pprint(f'Input: {synthetic_dataset[0][\"input\"]}')\n",
    "pprint(f'Output: {synthetic_dataset[0][\"output\"]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fBDHJVXhIXyG"
   },
   "source": [
    "Now go try it out with your own datasets! You can vary the `task_type` for different types of generated instructions.\n",
    "You can also play around the sampling hyperparameters such as `top_p` and `temperature`.\n"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "V100",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
